Applying Deep Belief Networks to Word Sense Disambiguation 
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Abstract 

In this paper, we applied a novel learn- 
ing algorithm, namely. Deep Belief Net- 
works (DBN) to word sense disambigua- 
tion (WSD). DBN is a probabilistic gen- 
erative model composed of multiple lay- 
ers of hidden units. DBN uses Re- 
stricted Boltzmann Machine (RBM) to 
greedily train layer by layer as a pre- 
training. Then, a separate fine tuning 
step is employed to improve the discrim- 
inative power. We compared DBN with 
various state-of-the-art supervised learn- 
ing algorithms in WSD such as Support 
Vector Machine (SVM), Maximum En- 
tropy model (MaxEnt), Naive Bayes clas- 
sifier (NB) and Kernel Principal Com- 
ponent Analysis (KPCA). We used all 
words in the given paragraph, surrounding 
context words and part-of-speech of sur- 
rounding words as our knowledge sources. 
We conducted our experiment on the 
SENSEVAL-2 data set. We observed that 
DBN outperformed all other learning al- 
gorithms. 

1 Introduction 

A major difficulty of Natural Language Processing 
is to automatically resolve many ambiguities aris- 
ing in human language, for instance, lexical ambi- 
guity. When we put a polyseme into a sentence in 
order to communicate with other people, it is diffi- 
cult for human to specify the meaning of that pol- 
yseme especially when there are many polysemes 
in a document. For example, a word "snow leop- 
ard" can refer to either an animal or a Macintosh 
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operating system. However, we can look at the 
surrounding words to guess the meaning. By this 
guessing, we can use machine to help us disam- 
biguate those polysemes in a document. 

Word sense disambiguation (WSD) is a task to 
computationally identify the appropriate meaning 
s from the given set of meaning S for a word w 
in a given context c. WSD is considered to be a 
fundamental task to achieve a high performance 
in Machine Translation (MT). Other applications 
of WSD include Information Retrieval (IR), Infor- 
mation Extraction (IE) and text mining. 

There are four approaches for WSD which are 
knowledge-based methods, unsupervised corpus- 
based methods, supervised corpus-based methods 
and combinations of those approaches. In this 
paper, we focus on the supervised corpus-based 
approach which has been constantly observed as 
the highest performance gainer. A supervised ap- 
proach starts with building feature vectors then 
employing learning algorithms for those feature in 
a classification. 

Feature vectors can be constructed from the text 
in which the word w has occurred. To begin with, 
the correct senses of the word w in each con- 
text will be manually tagged and used as a la- 
bel. Then, knowledge sources will be considered 
to make a feature such as part-of-speech or lo- 
cal bigram. Consequently, we will get one fea- 
ture vector for each context and will be used as 
a training set to train a classifier for each word 
w. There is an official competition which is con- 
ducted once in three years. The data sets from 
this competition are SENSEVAL-1, SENSEVAL- 

"This work started when this author was at Interdisci- 
pUnary Graduate School of Science and Engineering, Tokyo 
Institute of Technology. 



2, SENSEVAL-3, SEMEVAL-1 and SEMEVAL- 
2^ . In this paper, we considered the EngUsh lexical 
sample task of SENSEVAL-2 which has 73 word 
tasks including tasks for nouns, verbs and adjec- 
tives. SENSEVAL-2 used WordNet 1.7 to label 
the data. There are 75 to 300 instances in each 
word task. 

The goal of a learning algorithm is to predict an 

unseen example correctly using knowledge from 
previously seen examples. Until now, the learning 
algorithms that have been shown to work well in 
WSD are Naive Bayes (NB), Nearest Neighbors 
(NN), and Support Vector Machine (SVM). How- 
ever, those learning algorithms are all 'shallow' 
learning algorithms. Shallow learning algorithms 
mean that the learning algorithms do not consist of 
nonlinearity that is complex enough to model hu- 
man behaviors. Shallow learning algorithms may 
be effective when used to create a simple system. 
For example, it may succeed in one problem with 
a lot of human works in feature engineering but 
this system will be task specific and could not be 
reused for a new problem even if the problem is 
similar to the previous one. In addition, the fea- 
ture vectors, which are the input of shallow learn- 
ing algorithms, are sparse and cause the curse of 
dimensionality problem. 

Deep learning algorithms aim at learning fea- 
ture hierarchies where higher level features are 
formed by the composition of lower level features. 
Although the features are constructed in a recur- 
sive manner, each feature level represents a dif- 
ferent level of abstraction. This is important for 
extraction of higher level abstractions where hu- 
man cannot explicitly specify the system. Thus, 
deep learning may be used in addition to typical 
feature engineering for natural language process- 
ing where the system will have more coverage be- 
cause a feature extractor from deep learning can 
be generalized to similar problems. Until now, 
there are many proposed deep learning algorithms; 
however, this paper will investigate the behavior of 
Deep Belief Networks (DBN). 

In this paper, we conducted an experiment to 
compare various shallow learning algorithms with 
DBN on basic features of the SENSEVAL-2 En- 
glish lexical sample data set. 
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2 Related Works 

Lee and Ng (2002) evaluated various learning al- 
gorithms with many knowledge sources and the 
result claimed that a hnear Support Vector Ma- 
chine (SVM) is the best classifier. Escudero 
(2006) also investigated the effectiveness of the 
Linear SVM. We et al. (2004) introduced Ker- 
nel Principal Component Analysis (KPCA) with 
polynomial kernels to find a nonlinear combina- 
tion of features for classifiers. The result showed 
that Naive Bayes (NB), Maximum Entropy (Max- 
Ent) and SVM all got better performance. 

There are several works that apply neural net- 
works to WSD. Cottrell (1989) was the first who 
proposes neural networks for WSD. However, 
Towell and Voorhees (1998) argued that neural 
networks without a hidden layer have better per- 
formance. This goes with the previous statement 
which concluded that WSD data is likely to be lin- 
ear and sparse. Thus, Linear SVM would be the 
best classifier for WSD. 

Recently, deep learning algorithms consistently 
showed interesting results over shallow algorithms 
in many natural language processing tasks. Col- 
lobert and Weston (2008) proposed a deep neural 
network architecture which can be applied to part- 
of-speech tagging, chunking, name entity recog- 
nition and semantic role labeling simultaneously. 
The proposed architecture learns internal repre- 
sentation and shares that representation as a fea- 
ture among tasks. Mnih and Hinton (2008) pro- 
posed a deep neural network for language model 
which outperforms non-hierarchical neural mod- 
els and n-gram language models. Other suc- 
cesses are machine transliteration (Deselaers et al., 

2009) , sentiment analysis (Zhou et al., 2010;Glo- 
rot et al., 2011), question answering (Wang et 
al., 2010), named entity recognition (Chen et al., 

2010) , relation extraction (Chen et al., 2010), Pars- 
ing (Socher et al., 2010). As far as we know, there 
is still no investigation with recently advanced 
deep learning in WSD. 

In WSD, it is empirically shown that linear 
SVM works best so the structure of the data seems 
to be linear. However, this work will address 
the possibility that WSD data may be nonlinear 
and the performance can be improved when using 
deep learning algorithms even if the number of in- 
stances per class is small and the feature vector is 
highly sparse. 



3 Knowledge Sources 

3.1 Topical Feature 

We collected all unigrams in the provided context 
whether they were in the different sentences or not 
and encoded them to a binary bag-of-words fea- 
ture vector. We used the word segmentation mod- 
ule and Porter stemmer module from NLTK (Bird 
et al., 2009) for preprocessing. We also used stop 
words list from NLTK to remove stop words. This 
type of feature defines a general topic of the text 
which comes from an intuition that the words in 
the same topic usually occur together. 

3.2 Local Feature 

We specified the size of window which covers 
around the target word w needed to be disam- 
biguated. The window will produce the words 
before and after the word w. The typical win- 
dow size is between 3 to 10 words. This fea- 
ture type encoded the position of words in local 
vicinity. We included local unigram, bigram and 
trigram in order to construct the feature vector. 
For example, from the phrase "cross the river", 
we will have three unigrams (cross, the, river), 
two bigrams (cross_the, the_river) and one trigram 
(cross_the_river). We used the word segmentation 
module from NLTK (Bird et al, 2009) for prepro- 
cessing. Finally, we got the binary feature which 
represented the local feature. In the experiment, 
we found that the window size of 7 yielded the 
best performance. 

3.3 Part-of-speech Feature 

We used the part-of-speech tagger module from 
NLTK (Bird et al, 2009) to tag all unigrams in 
the specified window. Then, we encoded them 
to binary features which represented the position 
of the part-of-speech tag of each word. For in- 
stance, assume that we have four part-of-speech, 
NN (Noun), VB (Verb), ADJ (Adjective) and 
DT (Determiner). If we have the tagged phrase 
"cross/VB the/DT river/NN", the feature vec- 
tor will be (0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0). First 
four digits encode the part-of-speech of the word 
"cross" and so on. In this feature, we used only 
the words in the same sentence as the target word 
w. 

4 Learning Algorithms 

We evaluated following seven learning algorithms 
in order to compare with Deep Belief Networks. 



Those learning algorithms are Naive Bayes, Near- 
est Neighbors, Principal Component Analysis, 
Kernel Principal Component Analysis, Logistic 
Regression (MaxEnt), Multilayer Perceptron and 
Support Vector Machine. In this section, we de- 
note X as a data instance and y as a label instance. 
X and Y are matrics where each column is a data 
instance x or label instance y respectively. 

4.1 Naive Bayes 

Naive Bayes (NB) is a simple learning algorithm 
which illustrates the use of Bayes rule with the as- 
sumption that all features are conditionally inde- 
pendent given a class. NB chooses the class with 
highest posterior probability as a prediction. In 
the experiment, we used Nave Bayes module from 
NLTK (Bird et al., 2009) and used Laplace (add- 
one) smoothing. 

4.2 Nearest Neighbor 

A nearest Neighbor classifier (NN) classifies by 
choosing the closest training example in the fea- 
ture space. NN are often regarded as lazy learn- 
ing since the computation will be done only when 
classification. A k-Nearest Neighbor algorithm 
takes a majority vote among its k neighbors. How- 
ever, NN will be extremely slow of the data having 
many instances or many dimensions. In the exper- 
iment, we used Nearest Neighbor algorithm from 
scikits.learn^ and set the k parameter to 1. 

4.3 Principal Component Analysis 

Principal Component Analysis (PCA) is a dimen- 
sionality reduction technique. PCA maps data 
points to the feature space while preserving as 
much variance as possible. PCA solves the eigen- 
value problem of the zero-mean covariance matrix 
C to find eigenvectors V ordered by descending 
magnitude of the corresponding eigenvalues A and 
uses them as bases for projection. 

C = cov{x) (1) 

CV = XV (2) 

The target feature space usually has very small di- 
mension compared to the original feature space. 
In the experiment, we implemented PCA by using 
NumPy and SciPy (Jones et al., 2001). We spec- 
ified target dimension to 30 and used 1-NN as a 
classifier. 

^http://sci]dt-leam.sourceforge.net 



4.4 Kernel Principal Component Analysis 

Kernel Principal Component Analysis (KPCA) 
extends PCA to nonlinearity. KPCA introduces 
kernel trick where the data is mapped to reproduce 
kernel Hilbert space (RKHS) which is a conve- 
nient way to model nonlinearity by implicit map- 
ping. KPCA computes kernel matrix K using the 
kernel function k. 

K = K{xi,Xj) (3) 

Matrix K is double-centered by the following 
equation, 

Kij = Kij - - X] ^■^P ~ ^ X] + ^ X! 

p q pq 

(4) 

Then, KPCA solves the eigenvalue problem like 
PCA. 

KV = XV (5) 

The bases for projection are the eigenvectors 
scaled by the square root of their corresponding 
eigenvalues. 

Oj = -^Vi (6) 

Noted that, the test data are needed to be double- 
centered with the training data before projection. 

In the experiment, we implemented KPCA by 
using NumPy and SciPy (Jones et al, 2001). We 
specified target dimension to 30 and used 1-NN as 
a classifier. We experimented with Gaussian RBF 
and polynomial kernels. 

4.5 Logistic Regression 

Logistic Regression applies the technique of linear 
regression to the classification problem in a proba- 
bilistic way. Logistic Regression could be consid- 
ered as an instance of Maximum Entropy model 
(MaxEnt). The objective function of Logistic Re- 
gression is to minimize the prediction error of the 
prediction : 

Vpredict = argmaxiP{Y = i\x, W, b), (7) 

where W is the weight matrix and b is the bias. 
Probability for each class is the value of the soft- 
max function of the input. 

pWxi+b 

Piy = ^\^^W^b) = —^^ (8) 

In the experiment, we employed Logistic Re- 
gression from Theano (Bergstra et al, 2010) 
which used Stochastic Gradient Descent (SGD) to 
optimize the loss function. We fixed the learning 
rate of0.13 for SGD. 



4.6 Multilayer Perceptron 

Multilayer Perceptron (MLP) is a feedforward ar- 
tificial neural network model. MLP consists of in- 
put layer, hidden layer and output layer. Feature 
vector will be viewed as an input layer where each 
feature corresponds to an input node. The hid- 
den layer tries to transform the input feature vec- 
tor by learning. The output layer takes the output 
of hidden layer as an input and acts as a classi- 
fier. MLP uses backpropagation algorithm (BP) 
for learning. BP adjusts weights with respect to 
the gradient of an error measure. The error in 
the output unit is computed first, and the error is 
propagated through all layers. In the experiment, 
we used MLP with one hidden layer from Theano 
(Bergstra et al., 2010) and used Stochastic Gradi- 
ent Descent (SGD) to optimize the loss function. 
We fixed the learning rate of 0.01 for SGD. We 
used one hidden layer with 1,000 nodes. 




Output layer 
Hidden layer 
Input layer 



Figure 1 : The architecture of Multilayer Percep- 
tron. 

4.7 Support Vector Machine 

Support Vector Machine (SVM) finds an optimal 
separating hyperplane for two class classification 
where the margin is widened as possible by using 
quadratic programming. If two classes are non- 
separable, parameter C will be used for controlUng 
the tradeoff between the width of the margin and 
training error as follows: 

n 

min AllH^f 4-Cy £i (9) 
s.t. yi{{W,Xi) +b)>l-ei, andsi > OVz. (10) 

SVM may be extended by the kernel trick to sup- 
port nonlinearity in the data set in the same sense 
as KPCA but in the dual forms of SVM optimiza- 
tion problem. 

In the experiment, we used SVM from scik- 
its.leam which made a function call to LIBSVM 
(Chang and Lin, 2011) and LIBLINEAR (Fan et 



al, 2008). We set the parameter C to 1. For non- 
linear SVM, we used third order polynomial ker- 
nel and Gaussian RBF kernel with parameter T 
(gamma) set to 3. 

5 Deep Belief Networks 

Deep Belief Networks (DBN) (Hinton, 2006) are 
graphical models which extract hierarchical repre- 
sentation from the data. DBN consists of multiple 
layers of binary stochastic latent variables. The 
learning steps of DBN start by greedily learning 
the feature layer by layer one layer at a time using 
a kind of Markov Random Field (MRF) called Re- 
stricted Boltzmann Machine (RBM). The learned 
hidden layer will be used as an input layer for an- 
other layer recursively. The objective of this phase 
is to find a good parameter set for DBN which is 
used as an initial parameter for the second phase 
which all layers will be fine-tuned with the back- 
propagation algorithm (BP) to improve discrimi- 
native power. The second step will adjust all pa- 
rameters in all layers. 




Figure 2: The architecture of Deep Belief Net- 
works. 

In the experiment, we used DBN from Theano 
(Bergstra et al., 2010) and used Stochastic Gradi- 
ent Descent (SGD) to optimize the loss function. 
We used held-out cross validation to tune the pa- 
rameters. We fixed pretraining iteration to 25 for 
all word tasks. We determined finetuning itera- 
tion based on cross validation. We fixed pretrain- 
ing rate to 0.1 and finetuning rate to 1. We got the 
best architecture of three hidden layers which have 
100 hidden nodes for each layer. 

5.1 Restricted Boltzmann Machine 

Restricted Boltzmann Machine (RBM) has one 
visible layer v and another hidden layer h. There 
are only edges with weights W connecting be- 



tween nodes in different layers. This makes RBM 
a bipartite graph. 



Weight 




Figure 3: The architecture of Restricted Boltz- 
mann Machine. 

Moreover, this also makes the hidden nodes to 
be independent given the visible node. 

p{v\h) = Y[pivi\h) (11) 

i 

p{h\v) = l[p{hj\v) (12) 

j 

So, when data vector v is given, we can get an 
unbiased sample quickly from the posterior dis- 
tribution. Thus, this eliminates explaining away 
effect in graphical models. RBM tries to model h 
by reconstructing v with minimum error. RBM is 
modeled by an Energy Based function (EB). The 
energy function could be defined as 

E{v, h) = -6> - b'f^h - h'Wv, (13) 

where by and bh are biases for visible layer and 
hidden layer respectively. The activation function 
of RBM is as follows. 

Learning in RBM via maximum hkelihood can be 
achieved by Gradient Descent but could be ap- 
proximated by Gibbs sampling. RBM starts with 
taking an input vector as a visible layer. Then, 
RBM updates all hidden nodes simultaneously. 
After that, RBM tries to reconstruct the visible 
layer to get the reconstruction to update the hid- 
den layer again. The gradient term is as follows, 

=f((^i^i)data-(^i^i)recon.tr«rfed)- (1^) 

This Gibbs sampling can be done iteratively until 
converge. However, this consumes a lot of compu- 
tational power. So, Contrastive Divergence (CD) 
was proposed by (Hinton, 2002) to approximate 
this process by introducing KL-divergence such 
that performing Gibbs sampling only a few steps 
is enough. It was shown that only one Gibbs step 
is sufficient empirically. 
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Figure 4: Learning Restricted Boltzmann Ma- 
chine. 



5.2 Pretraining phase 

In WSD, features come from various knowledge 
sources and the classifier usually takes them with- 
out considering the relation among features. Hid- 
den nodes of RBM are the combinations of fea- 
tures which provide a way to model those rela- 
tions. By stacking RBMs, we can learn complex 
relations of knowledge sources. The learned hid- 
den layer will be used as an input vector for an- 
other hidden layer. This phase is unsupervised and 
does not require labels. 

5.3 Finetuning phase 

After an unsupervised pretraining, the current pa- 
rameter set is a good initial set to start local search 
by the backpropagation algorithm (BP). In pre- 
training, the search space is smoother and the op- 
timal value is near the one in finetuning phase so 
this eliminates the possibility of stuck in poor lo- 
cal optimum. Backpropagation after pretraining 
works better because it works better in a large net- 
work since the gradient may become small but that 
slight changes of weights is enough to get a good 
model. 

6 Empirical Results 

6.1 Data set and Evaluation 

In this paper, we used the SENSEVAL-2 data set 
which has 73 word tasks, 8,611 training instances 
and 4,328 test instances. All senses were labeled 
by WordNet 1.7. Our experiment is based on 
the official data set and fine-grained evaluation 
of SENSEVAL-2 which measured system perfor- 
mance by micro-average recall (mi). Moreover, 
we measured the significant by performing two 
sample one-sided t-test between DBN and other 
learning algotithms as in Table 2. 



mi = 



number of correctly predicted instances 



number of all test instances 



(17) 



The baseline is Most Frequent Sense (MFS) which 
always chooses the major class of each word task 
as its prediction. 

6.2 Topical Feature 

In topical feature, one Nearest Neighbor algorithm 
(1-NN) performed lower than the baseUne and di- 
mensionality reduction techniques improved the 
performance only a little. The polynomial ker- 
nel tends to work better than the Gaussian RBF 
kernel in Kernel Principal Component Analysis 
(KPCA) but worked worse in Support Vector Ma- 
chine (SVM) which only got a comparable per- 
formance as the baseline. Among shallow learn- 
ing algorithms. Linear SVM worked best followed 
by Logistic Regression. Multilayer Perceptron 
(MLP), despite having hidden layer that made it 
able to model nonlinearity, worked worse than Lo- 
gistic Regression. This goes with the argument of 
(Towell and Voorhees, 1998). In spite of small 
sample per class and noisy feature. Deep Belief 
Networks (DBN) achieved the best performance. 
DBN outperformed the baseline by 9.65%, Logis- 
tic Regression by 2.07%, MLP by 2.24% and Lin- 
ear SVM by 1.98%. 

6.3 Local Feature 

In local feature, 1-NN performed well by less 
noisy feature but dimensionality reduction tech- 
niques did not improve the performance. Gaus- 
sian RBF kernel tends to work better than poly- 
nomial kernel in KPCA and SVM. Nave Bayes 
(NB) got worse performance compared to topical 
feature. The reason could be its strong indepen- 
dent assumption agrees with bag-of-words feature 
more than binary feature since bag-of-words fea- 
ture also assumes independence among words in 
sentences. Among shallow learning algorithms. 
Linear SVM worked best followed by Logistic Re- 
gression and MLP. Moreover, they all got better 
performance compared to topical features since lo- 
cal feature is less noisy. DBN continued to achieve 
the best performance with better score than topical 
features by 3.98%. In this feature, DBN outper- 
form the baseline by 13.63%, Logistic Regression 
and MLP by 3.97% and Linear SVM by 2.73%. 

6.4 Part-of-speech Feature 

In part-of-speech feature, 1-NN has lower perfor- 
mance than the baseline by a little. Dimensionality 
reduction techniques did not improve the perfor- 
mance. Gaussian RBF kernel tends to work better 
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Table 1 : Micro-average recall (p- value compared to DBN (lower is better)) of various learning algorithms 
in SENSEVAL-2 data set. 



than polynomial kernel in KPCA and SVM. Naive 
Bayes (NB) got the best performance compared to 
topical feature and local feature. Among shallow 
learning algorithms, Logistic Regression worked 
best followed by MLP and linear SVM. The per- 
formance was worse than local feature but better 
than topical feature. DBN continued to achieve 
the best performance with better score than top- 
ical feature by 2.32% but worse score than local 
feature by 1.16%. In this feature, DBN outper- 
formed the baseline by 12.47%, Logistic Regres- 
sion by 5.21%, MLP by 5.24% and Linear SVM 
by 8.13%. Compared to other features, DBN out- 
performed other shallow learning algorithms most 
significantly in this feature set. 

6.5 All Feature 

When these three features were combined, 1-NN 
performed worse because of noises and scarcity of 
the data. NB performed worse than local feature 
and part-of-speech feature but still better than top- 
ical feature alone. Logistic Regression and Linear 
SVM achieved the score higher than 60%. This 
shows that adding features improved performance 
in both linear learning algorithms. However, MLP 
performed a little bit worse than Logistic Re- 
gression. In spite of small sample per class and 
sparse feature, Deep Belief Networks (DBN) still 
achieved the best performance of 61.30%. DBN 
outperformed the baseline by 13.70%, Logistic 
Regression by 1.23%, MLP by 1.60% and Linear 
SVM by 0.90%. This did not have much improve- 
ment when compared to local feature alone. This 
may be concluded that using basic feature hke lo- 



cal feature can make DBN achieve a fairly high 
performance without adding many features. 

7 Summary and Future Work 

We have applied novel deep learning algorithm, 
namely. Deep Belief Networks (DBN) that makes 
an improvement to Word Sense Disambigua- 
tion (WSD) in term of accuracy. We evaluated 
three knowledge sources and compared with var- 
ious state-of-the-art shallow learning algorithms 
whether they are linear or nonlinear. The ex- 
periment results show superiority of DBN over 
many state-of-the-art algorithms including Sup- 
port Vector Machine (SVM). From Table 2, DBN 
outperformed the baseline, one Nearest Neigh- 
bor (1-NN), dimensionality reduction techniques 
(PCA and KPCA), Naive Bayes (NB) and non- 
linear SVM significantly. However, compared to 
Logistic Regression, Multilayer Perceptron (MLP) 
and Linear SVM, DBN significantly outperformed 
them in part-of-speech feature, slightly significant 
in local feature and not so much significant in top- 
ical and all features. 

We also found that DBN achieved a relatively 
high performance when using only local feature 
while SVM needed more features. Thus, this in- 
dicates that deep learning algorithms help us ex- 
tract useful properties from the data without exces- 
sive feature engineering. This shows that applying 
deep learning algorithms can be beneficial since 
there exists some nonlinearity in WSD data even 
if the data have a small number of instances and 
a lot of dimensions. The model of DBN that we 
got by cross validation shows that if the number of 



training instances is small, small architecture and 
large learning rate could be a good model. 

This leads to future works in many directions. 
Firstly, sharing representation across word tasks 
can be helpful to improve the overall task since 
there is a few example per word task. Secondly, 
more knowledge sources including ones without 
label may be incorporated. We will investigate 
these further directions in the future. 
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